Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is intended as a soundness check on work to address issue #7606.
Goal
To have a node (ssh for now, but applicable to other services as well) to notice when the proxy configuration changes and adapts automatically.
Context
Imagine you have a cluster with the a node connected in via a tunnel on port 3024
Now imagine you change the proxy config so that
tunnel_public_address
isexample.com:4024
. You either restart the proxy, or reload the proxy config with aSIGHUP
....and the node doesn't reconnect to the proxy, because even though the
auth_server
address hasn't changed the node has cached the oldtunnel_public_address
and keeps trying to connect to that.You can always restart the node to have it reconnect, but that would be a pain if you have thousands of nodes.
This PR's approach
I've initially attacked this by trying to leverage the discovery & re-connection machinery that already exists in the node (i.e. the
AgentPool
). Originally, the agent pool was given thetunnel_public_address
it the node discovered on startup and always attempted to reconnect to that.The WIP code in this MR removes that static config, and replaces it with a callback that will furnish the
AgentPool
with a list of potential proxy addresses. In the proof-of-concept handler it pollswebapi/find
to get the currenttunnel_public_address
value, and adds a newAgent
to theAgentPool
for that address if necessary. This is based on the assumption that once we add it to the agent pool, an automatic re-connection will automatically fall out of the existing machinery.This works as expected up to a point,
...but
It turns out there is a bunch of stuff in the
AgentPool
that assumes the originaltunnel_public_address
is still available (seeAgent.AccessPoint
) in order to process the new connection. This makes perfect sense in the context of the proxy Discovery, in that it's designed to handle locating and adding extra proxies to the pool. It models the notion of moving a tunnel proxy to a new address somewhat less well.This extends past the AgentPool as well. There are several places in the larger application that have a baked-in assumption that the
Client
wrapping the initial connection to the auth server is special, and the corresponding tunnel parameters will be valid (if not actually connected) for the lifetime of the process.We can, of course, track down all of the state information that needs updating and do so, but it is beginning to sound like I am making this change at the wrong level.
Alternative approaches
Tunnel watchdog
We could somehow detect the address change (e.g. polling
webapi/find
occasionally and noting any changes in the reported addresses) and automatically restarting the entire teleport instance on a change to thetunnel_public_address
, rather than trying to manage the change internally.This is starting to look like a sensible option, as so much of the teleport process assumes that the initial tunnel location will not change. I've started working on a proof-of-concept implementation that polls
webapi/find
, but this becomes problematic when a cluster might have multiple, independently configured proxies that can be hit by any given poll. Some extra work would be to be done in order prevent the SSH node restarting in a legitimate configuration.Refactoring out
A larger-scale refactor of the
AgentPool
(and friends) to try and make it easier to handle this "proxy moved" situation. This will essentially mean that the initial client connection to theproxy
becomes just another client in the pool, and can be rotated out of theAgentPool
s proxy set like any other.....?
What do i want out of this discussion?
Basically, either confirmation that the
AgentPool
is/is not the correct place to try an implement this change. And, if it's not, some alternative approaches to the problem